Supplemental material to launching FL experiment

This is a suplemental document to submit-fl-jobs.ipynb jupyter notebook. While we tried our best to include as much, as possible in the .ipynb, it is not a good environment to include screenshots. We will do so in this document and elaborate more about outputs and possible problems.

Initial setup and editing

The first cell installing the packages is only needed to execute during the first run or when changing a version of nvflare. It installs the packages on the executor (where .ipynb runs), not federated server.

%pip install azure-ai-ml
%pip install nvflare==2.3.0
%pip install azureml-mlflow mlflow

It is important to note that for any operation with OS, which isn't dependent on python, we can use the ! symbol before the command. However, as we want to execute some commands only for current kernel (for example the pip install), we need to use % sign before such commands. ! command will install the package only for default kernel (currently is one of the python 3.8 kernels).

Regarding the kernel mechanism, it really is just an conda environment turned runtime, where the script is executed. It is possible to pick it in the upper right part of the screen. For installation of packages, you should be able to either use the cells with % or open the terminal, activate the right conda environment and install packages yourself.

Second cell is full of imports. This is again needed by the .ipynb and its executor. The packages that are used by the containers are not needed here, hence the lack of PyTorch for example.

The Set Variables section is needed to adjust based on the project you are running. Let's make a breakdown:

compute_name - This is a name of compute instance created during the first run, which is running the .ipynb. You can find the name in the panel above .ipynb.

Fig. 1 - Compute name

user_name - Name of the user you are using within the workspace. If you open the Notebooks tab, there should be your name at the top of hierarchy, so really just write this here

flare_root - Location of project.yml and custom code, relative to this .ipynb yml_config - Name and extension of the yml used for NVFlare project setup and customization

subscription_id and resource_group - Usual suspects, same as before experiment - name under which the experiment will be named and tracked within AML

project_name - This must correspond to the name give to the folder, where project is saved

server_url - Fully qualified domain name of the server. You need a static url of the server, in order to conduct experiments. target_server - name of the attached compute to be used server_workspace name of the workspace where attached compute is used

We can now move down, through Provision Flare Startup Kits section. In the last, 3rd cell, in the bottom, there is an option to adjust which projects are coppied to the admin. If you want it to be all, just change "$flare_root/$project_name" to "$flare_root/."

Next cell under the "Check Compute Targets and Flare Startup Kits" just checks whether provisioning has worked for all the clients.

Afterwards, we define a client environment. In our case, this is a docker container, created from prebuilt image by Microsoft, with conda file, which initializes a pip and installs pip packages. You can customize this a bit and for example provide your own BuildContext for Docker. For more information, please visit a documentation. In this section, you might want to adjust a base image, conda_file path or name of the environment. This environment is afterwards registered in all the workspaces.

We can now jump the "Start Flare Server in Container" settings, unless you changed ports from the NVFLARE base (8002 and 8003). In such a case, you want to adjust the docker_args within the command.

Now jumping to "Submit Deployment Packages to Flare Clients", there are several things to adjust:

cmd_symlink - If you are using PV/PVC with Kubernetes, you must link your AzureML mount path (specified in PVC, under ml.azure.com/mountpath), instead of /home/fldata. You can obviously remove this symlink and use the whole path inside the code, but this simplification may make some sense.

environment - Name of client environment, specified above in the "Define Client Target Environment" experiment_name, display_name, description - adjust accordingly to what you want to see in AML

Now we can jump one cell and end within section "Submit NVIDIA FLARE Job" and "Initialize APIRunner and check FLARE clients" subsection. Things we want to change here essentially depends on what is your project_admin called within your specified NVFLARE project yaml. If it is the same as in script, you don't need to change anything, otherwise you need to adjust command with cd line and both lines in FLAdminAPIRunner.

Actually run the job

If you ran the job before, make sure that all AML Jobs on the machines are stopped, server has been restarted and is started.

Now that we went through almost the whole script and adjusted things together, it is time to actually train. If everything is setup properly, you can actually scroll all the way to "Initialize APIRunner and check FLARE clients" subsection, click left on the downward pointing key (not the start logo) and select "Run all above this cell".

Fig. 2 - Location of the button

If you have provisioned the kits before, made manual changes to the provisioned kits and want to keep that setup, it may be worth to manually skip the Provision Flare Startup Kits section, as it will remove the changes and recreate the kits.

Now assume you did the whole run. What will happen? Well, a lot of things. Obviously, all the cells should finish with green tick. Some prompts may appear on the right top side of the screen, such as:

Fig. 3 - Prompts about job submit

Also, the show_azureml_jobs cell should highlights some jobs in status Starting or Preparing, such as:

Fig. 4 - Overview of running jobs

Now we can navigate back to Jobs tabs and we should see all the jobs running. They can be in different stages:

Machines can be in several states after the submission:

Preparing - happens after the environment or container has been modified. The container needs to be rebuilt, thus it may take several minutes, depending on the size of the container. If no changes have been done to the environment since the last preparation, this step will be skipped/will last seconds.
Queued - Job has everything needed in order to run prepared, now it is pulling the container or waiting for machine to have sufficient resources. In this phase, it may be possible to encounter error that states that resources to allocate pod are insufficient. You likely either specified bad instance specifications when setting up the cluster or there is something blocking necessary resources running on the machine.
Running - This indicates the job was able to arrive to the machine intact and execute. For FL, running doesn't tell whether the job itself is running or has already crashed.
Failed - Job has failed for some reason. Most likely it is due to code error or problems with virtual machine itself.
Cancelled - Job has been cancelled by a user.

Current phase of the job can be seen after clicking on the job itself. We need to wait until all the clients and server are in Running state.

Fig. 5 - Different states of jobs

Submitting NVFLARE job

Now that we got all the machines in a running state, we can trigger the "Submit NVIDIA FLARE Job" part of the notebook. The first one initializes APIRunner and check FLARE clients. We expect it to report back the individual clients we use and the number we expect to use during the experiment:

{
   "server_engine_status": "stopped",
   "status_table": [
      [
         "CLIENT",
         "TOKEN",
         "LAST CONNECT TIME"
      ],
      [
         "azure-gpu-machine",
         "d984d3ca-2a63-448a-becd-3cfba0b52695",
         "Mon May 22 13:29:24 2023"
      ],
      [
         "hospital-vishnu",
         "2b5b66a8-09a1-4a49-9a5b-a3faf9b99169",
         "Mon May 22 13:28:59 2023"
      ],
      [
         "hospital-brahma",
         "4a1a975a-07dc-4883-b41c-b838ac47685d",
         "Mon May 22 13:29:16 2023"
      ],
      [
         "hospital-shiva",
         "cade2090-4ff5-46e1-8f65-9419084cd0bd",
         "Mon May 22 13:29:22 2023"
      ]
   ],
   "registered_clients": 4
}

If the number of clients is smaller than the number needed for the run of the experiment or the number you expect to be connected, you need to take a look at the jobs of the missing clients.

Finally, if everything is fine, we can submit the job. Successful job submit should yield message akin to this:

{'status': <APIStatus.SUCCESS: 'SUCCESS'>, 'details': {'message': 'Submitted job: 4175fb1e-6753-49a6-979c-0ea411beb499', 'job_id': '4175fb1e-6753-49a6-979c-0ea411beb499'}, 'raw': {'time': '2023-05-22 13:29:30.663453', 'data': [{'type': 'string', 'data': 'Submitted job: 4175fb1e-6753-49a6-979c-0ea411beb499'}, {'type': 'success', 'data': ''}], 'status': <APIStatus.SUCCESS: 'SUCCESS'>}}
{'status': <APIStatus.SUCCESS: 'SUCCESS'>,
 'details': {'message': 'Submitted job: 4175fb1e-6753-49a6-979c-0ea411beb499',
  'job_id': '4175fb1e-6753-49a6-979c-0ea411beb499'},
 'raw': {'time': '2023-05-22 13:29:30.663453',
  'data': [{'type': 'string',
    'data': 'Submitted job: 4175fb1e-6753-49a6-979c-0ea411beb499'},
   {'type': 'success', 'data': ''}],
  'status': <APIStatus.SUCCESS: 'SUCCESS'>}}

How to change the provisioned files?

Sometimes, you need to adjust different timeouts or other settings for specific client. These are mostly configurable on the level of client. Our system provision in bulk and submit a container, which has one downside - if you are not careful about generating Flare provision kits again, they will overwrite manual customizations.

The generated kits are located within /dev/nvflare/workspace/prod_00 and can be edited there. However beware, these edit must to take part before Submit Deployment Packages to Flare Clients section.

How and when to change the NVFlare project files?

In general, you need to add your project folder into folder dev/nvflare/ in order to be able to submit it to machines. Project files are submitted to machines during the Submit Deployment Packages to Flare Clients section, thus all edit dones afterwards will be ignored.

How to track the experiment after submit?

Well, there are several ways. You can scroll down and click the first cell under the "Check status" headline. This will print a jobs, their status, time of execution etc. but won't give you more granular information about which epoch is currently running for example.

Other option is to navigate to the logs of the server. The most useful keyword is epoch as it allows you to deduct which epoch is taking place.

Last option is to navigate to the metrics tab of the server. Here you can see the metrics from different clients and depending on the graph see which epoch is running and which clients have already finished the epoch. This is however only applicable after the first epoch (lack of graphs or statistics in metrics means the first submissions from the clients weren't received yet).

How to get results from the training?

We are currently working on a more robusts way to get the models to automatically upload to AML studio. For now, the easiest way to retrieve the global model is to use cell from subsection "Download Model". The cell has a parameter, which is really just a string representing a job.

How to abort NVFLARE job without restarting clients and server?

If you uploaded server jobs to server and want to cancel the last one and run a new one, you can do so. Just use the cell under "Cancel the created job" with your job

However, if you launched wrong AML Job/on wrong AML machine, in general, you will need to manually stop all the jobs and restart server VM in order to be able to resubmit.

How to investigate current status of job/problems?

After opening the job, there is a tab called Outputs + logs. Depending on when the job has failed, there may be different files of interest. If it has failed during the image creation, it is for sure azureml-logs/20_image_build_log.txt file. Otherwise you want to take a look into folder called user_logs. This is where standard output of the machine is redirected and either training logs or crash output are present.

Supplemental material to launching FL experiment

Initial setup and editing​

Actually run the job​

Submitting NVFLARE job​

How to change the provisioned files?​

How and when to change the NVFlare project files?​

How to track the experiment after submit?​

How to get results from the training?​

How to abort NVFLARE job without restarting clients and server?​

How to investigate current status of job/problems?​